Lab 2: Exploring Image Data¶

Group mode:¶

Eric Houtman, Rick Lattin, Kellen Long

Important Citations¶

Kaggle Dataset: https://www.kaggle.com/datasets/andrewmvd/animal-faces?resource=download

Professor Larson's Github: https://github.com/eclarson/MachineLearningNotebooks

Extension Research: https://www.youtube.com/watch?v=yn1NUwaxhZg&t=8s

Business Analysis¶

Our data contains pictures of dogs and cats, and could be useful in identifying pictures of animals as either dogs or cats. The data was collected to be used in a case exactly like this, for training and identifying images of either dogs or cats in order to correctly identify their species.

Intuitively, the prediction task would be to classify each animal as either a dog or a cat and this could be used by third parties such as vets, animal shelters or pet care places. This could be used to take automated inventory for any of these industry, determining the amount and distribution of the cats and dogs in a given facility. It could also be used to determine which animals are which cats or dogs in order to distribute necessary substances to each of them, such as medicines that are specific to cats or dogs.

As for the accuracy of the results, the use cases involving medicine would need to have an extremely high level of accuracy, but almost any other classification task could have a more broad margin of error.

In [5]:
#load the images from the folder as numpy arrays
import numpy as np
import matplotlib.pyplot as plt

#import os and cv2
import os
from os import listdir
import cv2

#set the path to the folder with the images
cat_folder = "/Users/erichoutman/Desktop/SMU/Spring 2023/CS 5324-ML/Lab 2/train/cat"
dog_folder = "/Users/erichoutman/Desktop/SMU/Spring 2023/CS 5324-ML/Lab 2/train/dog"

We imported all of the necessary libraries and set the paths for the folder filled with cat and dog images respectively.

In [6]:
#read in the first 1500 images of the cat folder
cat_images = [cv2.imread(cat_folder + "/" + file) for file in listdir(cat_folder) if file.endswith(".jpg")][:1500]

#read in the first 1500 images of the dog folder
dog_images = [cv2.imread(dog_folder + "/" + file) for file in listdir(dog_folder) if file.endswith(".jpg")][:1500]

#give all the cat images a label of 0
cat_labels = [0 for _ in range(len(cat_images))]

#give all the dog images a label of 1
dog_labels = [1 for _ in range(len(dog_images))]

#combine the cat and dog images and labels into one list
images = cat_images + dog_images
labels = cat_labels + dog_labels

#resize the images
images = [cv2.resize(image, (244, 244)) for image in images]

#create a copy of the images that are converted to grayscale
gray_images = [cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in images]

In the cell above, we then read in and labeled all of the cat and dog images before resizing each image, so that the data is separated into two easily groups with equal dimensions on each image. We also altered the coloring of each image to be on a gray scale.

In [161]:
#read these images in as numpy arrays
images_array = np.array(images_gray)

#convert the labels to a numpy array
labels_array = np.array(labels)

#display the first 5 cat images using a subplot
plt.figure(figsize=(20,10))
for i in range(5):
    plt.subplot(1,5,i+1)
    plt.imshow(images_gray[i], cmap="gray")
    plt.title(labels[i])
    plt.axis("off")
plt.show()

#display the first 5 dog images using a subplot
plt.figure(figsize=(20,10))
for i in range(5):
    plt.subplot(1,5,i+1)
    plt.imshow(images_gray[i+1500], cmap="gray")
    plt.title(labels[i+1500])
    plt.axis("off")
plt.show()

We turned the list of images and list of labels into numpy arrays and printed out the first five dogs and cats images.

Coverting the data to a table of 1-D image features¶

In [139]:
#linearize the images to create a table of 1-D image features
images_linear = images_array.reshape(images_array.shape[0], -1)

#display the shape of the linearized images
print(images_linear.shape)

#display the first 5 rows of the linearized images
print(images_linear[:5])

#visualize the first 5 images in the linearized images
plt.figure(figsize=(20,10))
for i in range(5):
    plt.subplot(1,5,i+1)
    plt.imshow(images_linear[i].reshape(244,244), cmap="gray")
    plt.title(labels[i])
    plt.axis("off")
plt.show()
(3000, 59536)
[[113 104  93 ... 194 231 245]
 [  3   3   3 ... 142 141 137]
 [122 114  95 ... 146 143 142]
 [107 107 107 ... 115 111 112]
 [245 247 248 ... 181 183 181]]

We then resize the images to be a single row of integers representing the pixel values so that each pixel is treated as an individual feature.

Some specifics of our dataset¶

In [140]:
#x is the original grayscale images
#y is the labels
#names is the names of the classes
X = images_linear
y = labels_array
names = ["cat", "dog"]

n_samples, n_features = X.shape
_, h, w = images_array.shape
n_classes = len(names)

print(np.sum(~np.isfinite(X)))
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("Original Image Sizes {} by {}".format(h,w))
0
n_samples: 3000
n_features: 59536
n_classes: 2
Original Image Sizes 244 by 244

Then, We print out data on the dataset of images regarding the number of features, classes and size of the images in order to get a scope of our data.

In [141]:
# helper plotting function
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

#shuffle the images and labels
from sklearn.utils import shuffle
images_linear, labels_array = shuffle(images_linear, labels_array, random_state=0)

#plot the first 18 images
plot_gallery(images_linear, labels_array, h, w)

Now, with our dataset created, we shuffle all of the cats and dogs in the numpy arrays in which they are stored and then print out the first 18 cats and gods in the set (in gray scale) in order to verify that the shuffle functioned correctly.

Dimensionality Reduction with PCA¶

In [142]:
from sklearn.decomposition import PCA

n_components = 244
h, w = images_array[0].shape
print ("Extracting the top %d eigenfaces from %d faces" % ( n_components, X.shape[0]))

pca = PCA(n_components=n_components)

%time pca.fit(X.copy())

eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 244 eigenfaces from 3000 faces
CPU times: user 1min 51s, sys: 20.3 s, total: 2min 12s
Wall time: 20.9 s

We conducted PCA on each image in the dataset, instructing the PCA function to reduce the dataset into 244 components. The number of target components was selected to be 244 as each original image had 59,536 components and 244 represents the square root of the that number. Making the target components the square root of the original number of components allows the PCA function to reduce the dimensions of each image much faster and more effectively, while maintaining many of the same features as the original image. We deduced that the PCA would function in this way due to the fact that each image was represented as a perfect square, with each dimension being 244 pixels long. We verified this deduction through testing many different values with the PCA function and it became very obvious that using the square root of the number of components was the most effective method for this form of dimensionality reduction.

Visualizing the explained variance of PCA¶

In [143]:
# this is a scree plot
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })

plot_explained_variance(pca)

Afterwards, we made a scree plot to represent the results of our PCA calculations in order to visualize the data.

Displaying the eigenfaces from the PCA¶

In [144]:
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

We displayed 18 images that have been through PCA to visually verify that the PCA had effectively reduced each image to only the most significant features.

In [145]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    

idx_to_reconstruct = 50   
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))

plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA')
plt.grid(False)

We reconstruct an individual image based off of the images that have gone through PCA to see how well the PCA data reflects the original image.

Dimensionality Reduction with Randomized PCA¶

In [146]:
n_components = 244
print ("Extracting the top %d eigenfaces from %d faces" % ( n_components, X.shape[0]))

rpca = PCA(n_components=n_components, svd_solver='randomized')

%time rpca.fit(X.copy())
Extracting the top 244 eigenfaces from 3000 faces
CPU times: user 1min 49s, sys: 19.1 s, total: 2min 8s
Wall time: 18.8 s
Out[146]:
PCA(n_components=244, svd_solver='randomized')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=244, svd_solver='randomized')

We conducted randomized PCA on each image in the dataset, instructing the randomized PCA function to reduce the dataset into 244 components. Similarly to PCA, the number of target components for randomized PCA were selected to be 244 as each original image had 59,536 components and 244 represents the square root of the that number. Randomized PCA is essentially using the same algorithm as PCA in order to reduce the dimensions of the images, randomized PCA just uses a randomized sample of the entire dataset in order to make those dimensionality reductions, effectively saving some run time. We did test the 244 target components against many other options for the number of components and, just as it did for PCA, it became very obvious that 244 was the fastest and most effective number of dimensions to require.

Visualizing the explained variance of Randomized PCA¶

In [147]:
eigenfaces = rpca.components_.reshape((n_components, h, w))

plot_explained_variance(rpca)

Afterwards, we made a scree plot to represent the results of our random PCA calculations in order to visualize the data.

Displaying the Eigenfaces using RPCA¶

In [148]:
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

Afterwards, we displayed 18 images that have been through randomized PCA to visually verify that the RPCA had effectively reduced each image to only the most significant features.

Comparison of PCA and Randomized PCA¶

In this section, we reconstruct each image based on its pca and rpca components. We then compare the reconstructed images with the original images using the widget seen below in order to see the effectiveness of the two methods.

In [149]:
import ipywidgets as widgets
import warnings
# warnings.simplefilter('ignore', DeprecationWarning)
# warnings.simplefilter("always",DeprecationWarning)

def plt_reconstruct(idx_to_reconstruct):
    idx_to_reconstruct = np.round(idx_to_reconstruct)
    
    x_flat = images_linear[idx_to_reconstruct].reshape(1, -1)
    reconstructed_image = pca.inverse_transform(pca.transform(x_flat.copy())) 
    reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(x_flat.copy()))
    
    plt.figure(figsize=(15,7))
    
    plt.subplot(1,3,1) # original
    plt.imshow(x_flat.reshape((h, w)), cmap=plt.cm.gray)
    plt.title("Original")
    plt.grid(False)
    
    plt.subplot(1,3,2) # pca
    plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
    plt.title(f"Full PCA, {n_components} elements")
    plt.grid(False)
    
    plt.subplot(1,3,3) # randomized pca
    plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
    plt.title(f"Randomized PCA, {n_components} elements")
    plt.grid(False)
    
    
    
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,images_linear.shape[0]-1,1))
interactive(children=(IntSlider(value=1499, description='idx_to_reconstruct', max=2999), Output()), _dom_class…
Out[149]:
<function __main__.plt_reconstruct(idx_to_reconstruct)>

In order to further explore our data and to quantitatively show the difference between PCA and Randomized PCA, we began by analyzing the explained variance of each algorithm on the same dataset.

In [150]:
#show the explained variance of both PCA and randomized PCA
plot_explained_variance(pca)

plot_explained_variance(rpca)
In [151]:
#calculate and display the explained variance of PCA and randomized PCA
print("PCA explained variance: {}".format(np.sum(pca.explained_variance_ratio_)))

print("Randomized PCA explained variance: {}".format(np.sum(rpca.explained_variance_ratio_)))
PCA explained variance: 0.9024461919932214
Randomized PCA explained variance: 0.902469958774991

During our analysis of each method, we found the explained variance to be almost the very same, with Randomized PCA having a slightly higher explained variance.

Moving forward with this, we decided to time each method as well, in order to further measure the differences between the two.

In [160]:
print("PCA time: ") 
%time pca.fit(X.copy())

print(" ")

print("Randomized PCA time: ") 
%time rpca.fit(X.copy())
PCA time: 
CPU times: user 1min 47s, sys: 17.7 s, total: 2min 5s
Wall time: 20.7 s
 
Randomized PCA time: 
CPU times: user 1min 48s, sys: 17.2 s, total: 2min 5s
Wall time: 17.8 s
Out[160]:
PCA(n_components=244, svd_solver='randomized')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=244, svd_solver='randomized')

After timing each method, we found that Randomized PCA was faster than PCA which would overall be more beneficial when representing images with fewer components, as it provided a slight increase in explained variance, while also being faster than PCA by a significant amount.

In the end, we believe that for our prediction task, we preferred using Randomized PCA for dimensionality reduction, as it was shown to be faster than PCA, while also providing a slight increase in explained variance.

Feature Extraction with DAISY¶

In [153]:
from skimage.feature import daisy

features, img_desc = daisy(img, 
                           step=30, # distance between centers of the descriptor regions
                           radius=30, # radius of the descriptor regions
                           rings=2, # number of rings
                           histograms=8, # number of histograms
                           orientations=8, # number of orientations
                           visualize=True)

imshow(img_desc)
plt.grid(False)

As seen in the diagram above, we decided to perform Daisy feature extraction with the parameters seen above, as we believed that these constraints well covered our dataset's features, while also leaving room for features like the overall shape of the animal.

In [154]:
# create a function to take in the row of the matrix and return a new feature
def apply_daisy(row,shape):
    feat = daisy(row.reshape(shape), step=30, radius=30, rings=2, histograms=8, orientations=8, visualize=False)
    return feat.reshape((-1))

# apply the function to each row of the matrix
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
CPU times: user 4min 4s, sys: 1.59 s, total: 4min 5s
Wall time: 4min 5s
(3000, 6664)

This creates a function to apply the DAISY feature extraction algorithm to the dataset, before timing the DAISY function as it is applied to every image in the dataset.

In [155]:
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
CPU times: user 3.69 s, sys: 565 ms, total: 4.25 s
Wall time: 517 ms

We then created a distance matrix to be used when performing nearest neighbor classification in the next cell.

Nearest Neighbor Classification¶

In [156]:
#import imshow from matplotlib
from matplotlib.pyplot import imshow
import copy
# find closest image to current image
idx1 = 5
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)

plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
plt.imshow(X[idx1].reshape((h,w)), cmap='gray')
plt.title("Original Image")
plt.grid()

plt.subplot(1,2,2)
plt.imshow(X[idx2].reshape((h,w)), cmap='gray')
plt.title("Closest Image")
plt.grid()
In [157]:
import copy

from sklearn.metrics.pairwise import pairwise_distances

def closest_image(dmat_daisy, idx1):
    
    distances = copy.deepcopy(dmat_daisy[idx1,:]) 
    distances[idx1] = np.infty 
    idx2 = np.argmin(distances)
    
    
    plt.figure(figsize=(15,10))
    plt.subplot(2,2,1)
    plt.imshow(X[idx1].reshape((h,w)), cmap='gray')
    plt.title("Original Image")
    plt.grid()

    plt.subplot(2,2,2)
    plt.imshow(X[idx2].reshape((h,w)), cmap='gray')
    plt.title("Closest Image (Daisy)")
    plt.grid()


    
widgets.interact(closest_image,idx1=(0,n_samples-1,1), dmat_daisy=fixed(dist_matrix), __manual=True)
    
interactive(children=(IntSlider(value=1499, description='idx1', max=2999), Output()), _dom_classes=('widget-in…
Out[157]:
<function __main__.closest_image(dmat_daisy, idx1)>
In [166]:
#show the first 5 images with their closest daisy image
for i in range(5):
    closest_image(dist_matrix, i)

for i in range(5):
    closest_image(dist_matrix, i+1500)

After our analysis of the Daisy feature extraction method, we believe that this classification method would be useful for our prediction task, as the results shown in the nearest neighbor classification were very accurate. This accuracy, however did come with a drawback, as Daisy feature extraction with our dataset and parameters took a much longer time to complete than the other methods we tested. Even without 100% accuracy, we believe that this method could still be useful for our prediction task with the higher accuracy, however we also believe that those interested would need to allow for a much longer time to complete the classification.

Gabor Kernel Feature Extraction¶

In [158]:
import numpy as np
import cv2
import matplotlib.pyplot as plt

from matplotlib.pyplot import imshow

ksize = 10 
sigma = 5 
theta = 1*np.pi/2  
lamda = 1*np.pi/4   
gamma= 1
phi = 0.6  


kernel = cv2.getGaborKernel((ksize, ksize), sigma, theta, lamda, gamma, phi, ktype=cv2.CV_32F)

plt.imshow(kernel)


#grab the first image
img = X[5].reshape((h,w))
imshow(img, cmap='gray')

fimg = cv2.filter2D(img, cv2.CV_8UC3, kernel)

kernel_resized = cv2.resize(kernel, (400, 400))                    # Resize image


plt.imshow(kernel_resized)
plt.imshow(fimg, cmap='gray')
Out[158]:
<matplotlib.image.AxesImage at 0x7fab9af60a00>

We tested different variables to be used with the Gabor Kernel Feature Extraction method, and found that the parameters seen above provided the best results for our dataset.

Nearest Neighbor Classification w/ Gabor Filters¶

In [159]:
# apply gabor filter to all images
def apply_gabor(row,shape):
    feat = cv2.filter2D(row.reshape(shape), cv2.CV_8UC3, kernel)
    return feat.reshape((-1))

# apply to entire data, row by row,
%time gabor_features = np.apply_along_axis(apply_gabor, 1, X, (h,w))
print(gabor_features.shape)

%time dist_matrix_gabor = pairwise_distances(gabor_features)

#perform closest image with gabor filter
def closest_image_gabor(dmat_gabor, idx1):
    
    distances = copy.deepcopy(dmat_gabor[idx1,:]) 
    distances[idx1] = np.infty 
    idx2 = np.argmin(distances)
    
    plt.figure(figsize=(10,16))
    plt.subplot(1,3,1)
    #turn the picture back to grayscale
    imshow(X[idx1].reshape((h,w)), cmap='gray')
    plt.title("Original:"+names[y[idx1]])
    plt.grid()

    plt.subplot(1,3,2)
    imshow(X[idx2].reshape((h,w)), cmap='gray')
    plt.title("Gabor Closest:"+names[y[idx2]])
    plt.grid()

    #for the third picture, show the closest picture with the gabor filter applied
    plt.subplot(1,3,3)
    imshow(gabor_features[idx2].reshape((h,w)), cmap='gray')
    plt.title("Gabor with filter:"+names[y[idx2]])
    plt.grid()
    
    
#create widget for gabor filter
widgets.interact(closest_image_gabor,idx1=(0,n_samples-1,1),
                    dmat_gabor=fixed(dist_matrix_gabor),
                    __manual=True)

    
CPU times: user 2.47 s, sys: 215 ms, total: 2.69 s
Wall time: 2.39 s
(3000, 59536)
CPU times: user 31.7 s, sys: 2.56 s, total: 34.3 s
Wall time: 5.39 s
interactive(children=(IntSlider(value=1499, description='idx1', max=2999), Output()), _dom_classes=('widget-in…
Out[159]:
<function __main__.closest_image_gabor(dmat_gabor, idx1)>
In [169]:
#show the first 5 images with their closest gabor image
for i in range(5):
    closest_image_gabor(dist_matrix_gabor, i)
    plt.show()    

Similar to the previous section with the Daisy feature, we decided to analyze the performance of our Gabor Filter feature extraction by using a nearest neighbor classification. Through this classfication, we found that the accuracy of the classification seemed to be higher than that of PCA, however it was not as high as that of Daisy feature extraction, with several misclassifications being apparent. From a visual perspective, we believe that Gabor filter extraction is much better at identifying the overall shape of faces from animals, however this would contribute to misclassifications of certain animals when the shape more closely related to the opposite type of animal.

Overall, we believe that this method could still be useful for our prediction task, however we believe that the Daisy feature extraction method would be more useful, as it was able to achieve a higher accuracy even though it took longer to complete.